After pulling the data from CPD data warehouse, the data has a format that looks like:
## DATEOCC YEAR MONTH DAY DOW CURR_IUCR FBI_CD AREA BEAT DISTRICT
## 1 2008-01-01 2008 1 1 Tue 0320 03 2 631 6
## 2 2008-01-01 2008 1 1 Tue 0265 02 5 1412 14
## 3 2008-01-01 2008 1 1 Tue 1754 02 1 725 7
## X_COORD Y_COORD LOCATION INC_CNT
## 1 1183288 1850874 304 1
## 2 1152781 1918361 090 1
## 3 1167145 1859291 290 1
A preview of the variables
## 'data.frame': 693175 obs. of 14 variables:
## $ DATEOCC : Date, format: "2008-01-01" "2008-01-01" ...
## $ YEAR : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
## $ MONTH : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DAY : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DOW : Factor w/ 7 levels "Fri","Mon","Sat",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ CURR_IUCR: Factor w/ 75 levels "0110","0130",..: 20 7 75 14 75 74 75 8 8 75 ...
## $ FBI_CD : Factor w/ 7 levels "01A","02","03",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ AREA : Factor w/ 6 levels "0","1","2","3",..: 3 6 2 2 3 3 5 5 3 6 ...
## $ BEAT : Factor w/ 305 levels "111","112","113",..: 71 172 83 101 269 54 138 169 67 212 ...
## $ DISTRICT : Factor w/ 26 levels "1","2","3","4",..: 6 14 7 8 22 5 11 13 6 17 ...
## $ X_COORD : int 1183288 1152781 1167145 1160916 1173738 1182361 1150832 1162552 1171606 1148547 ...
## $ Y_COORD : int 1850874 1918361 1859291 1859682 1836987 1838427 1899022 1900718 1853535 1929621 ...
## $ LOCATION : Factor w/ 99 levels "","090","092",..: 87 2 79 79 79 79 79 79 79 79 ...
## $ INC_CNT : int 1 1 1 1 1 1 1 1 1 1 ...
A summary of the data
## DATEOCC YEAR MONTH DAY
## Min. :2008-01-01 Min. :2008 Min. : 1.000 Min. : 1.00
## 1st Qu.:2009-06-20 1st Qu.:2009 1st Qu.: 4.000 1st Qu.: 8.00
## Median :2011-02-19 Median :2011 Median : 7.000 Median :16.00
## Mean :2011-03-25 Mean :2011 Mean : 6.513 Mean :15.66
## 3rd Qu.:2012-11-20 3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :2014-12-31 Max. :2014 Max. :12.000 Max. :31.00
##
## DOW CURR_IUCR FBI_CD AREA BEAT
## Fri: 98816 0486 :210289 01A: 3036 0 : 3 421 : 6896
## Mon: 94869 0460 :148648 02 : 12482 1 :199558 624 : 6674
## Sat:105105 0560 : 99396 03 : 99219 2 :201478 423 : 6650
## Sun:108317 0320 : 36399 04A: 36279 3 :130101 511 : 5567
## Thu: 95168 031A : 35239 04B: 60408 4 : 77124 612 : 5450
## Tue: 94597 0430 : 20262 08A:109131 5 : 84889 (Other):661920
## Wed: 96303 (Other):142942 08B:372620 NA's: 22 NA's : 18
## DISTRICT X_COORD Y_COORD LOCATION
## 7 : 52718 Min. :1094469 Min. :1813932 303 :136428
## 11 : 47998 1st Qu.:1152995 1st Qu.:1856702 090 :125008
## 6 : 47268 Median :1166410 Median :1878481 304 :117738
## 4 : 46469 Mean :1165222 Mean :1881764 290 :110815
## 8 : 45469 3rd Qu.:1177026 3rd Qu.:1906159 314 : 25288
## (Other):453235 Max. :1205097 Max. :1951533 092 : 20512
## NA's : 18 (Other):157386
## INC_CNT
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
##
The summary of how the crime counts are distributed in each area
##
## 0 1 2 3 4 5 <NA>
## 3 199558 201478 130101 77124 84889 22
and in each district
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 13929 25707 42677 46469 38309 47268 52718 45469 34477 34964 47998 18116
## 13 14 15 16 17 18 19 20 21 22 23 24
## 10239 22235 34333 17396 17217 18321 14324 10541 8421 23413 7202 20309
## 25 31 <NA>
## 41098 7 18
What need to be noticed are (a) District 31 only has 8 incidents, and (b) Area 0 only has 3 incidents during the 7 year period.
Most of the missing values (appearing in attribute AREA,DISTRICT and BEAT) have identical row indices.
From the shape files provided by CPD, the area, district and beat polygon maps are shown below
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/CPDShapeFiles/", layer: "area_bndy"
## with 8 features and 3 fields
## Feature type: wkbPolygon with 2 dimensions
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/CPDShapeFiles/", layer: "district_bndy"
## with 28 features and 3 fields
## Feature type: wkbPolygon with 2 dimensions
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/xiaomuliu/CrimeProject/SpatioTemporalModeling/CPDShapeFiles/", layer: "beat_bndy"
## with 288 features and 3 fields
## Feature type: wkbPolygon with 2 dimensions
A scatter point plot of violent crime locations for a certain day (2014-01-01)
Let’s first aggregate data by policing beat/district to see if there is, if any, spatial and temporl pattern at beat/district level. Both of the plots below try to unveil if different districts have similar seasonal patterns or not.
The top plot shows the daily crime time series. Note that the series of district 13, 21, and 23 seem to be truncated. It turned out that data of distirct 13, 21 and 13 is only avaiable up to 2012/12/16, 2013/03/02, and 2013/03/01 respectively. For the bottom plot, the crime counts were first grouped by year and then aggregated by district and month. Interestingly, seasonal patterns do vary in different districts.
Grouping by beat would present higher resolution view of spatial and temporal patterns. However, as we have nearly 300 beats, instead of using muit-panel plots, we resorted to heap map to show these patterns.
Again, some beats have strong decreaseing periodic seasonal trend while some others don’t. And the crime counts in adjacent beats are usually close.
Now let’s move from regional analysis to city-wide analysis. Here is a incident location plot of year 2014.
It is difficult to examine if crime location clusters are time-varying just by looking at the point plots. Let’s move to grid(pixel)-based analysis. First, the point data was rasterized through binning into a 100 \(\times\) 100 grid (the boundaries were defined by the range of x-coordinate and y-coordinate from all available crime locations plus a margin of 1000 unit on each side). Here shows an example of pixelized violent crime locations in January 2014.
Next, we do kernel density estimation (KDE) of the monthly aggregation over each year. The kernel applied here is a 2D Gaussian kernel with the same bandwidth in each direction. The bandwidth was selected through (minimizing MSE) cross-valiation using all available data (08-14). The figure below shows the KDE for each month for year 2014.
Here displays an animation of KDE for each year (08-14). It does not show there exists obvious crime hot spot migration throughout all the years being studied.